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ABSTRACT 


Voice Conversion is defined as modifying the speech signal of one speaker (source 
speaker) so that it sounds as if it had been pronounced by a different speaker (target 
speaker). In this thesis, we present a method for voice conversion by representing the 
joint probabilistic acoustic space of the two speakers with a Mixture of Factor Analyzers 
(MFAs). This can also be interpreted as a reduced dimension mixture of Gaussians. 

Most of the existing voice conversion systems are trained on aligned LSF vectors. 
However, there are many applications of voice conversion systems where the amount 
of training data from the source speaker and the target speaker is different. The 
amount of source data is large, but it is desired to estimate the transformation with a 
small amount of target data. The extra unaligned source data is incorporated into the 
training phase to estimate the parameters of the MFA and hence improve performance. 

Objective experiments demonstrate that the performance of the proposed system 
using factor analyzers is comparable to the performance obtained using existing sys- 
tems using Gaussian mixture models, with significant gains in both time and memory 
complexity. The addition of unaligned data in the training phase leads to a much 
superior performance in conversion. Subjective tests imply that small increments in 
the dimension of the factor analyzers does not make a difference perceptually to the 


listener when the increments are small. 
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Chapter 1 


Introduction 

Speech Signals convey a wide range of information. Among them, the meaning of 
the message being uttered is of prime importance. However, secondary information 
such as speaker identity also plays an important part in oral communication. Voice 
modification techniques attempt to transform the speech signals uttered by a given 
speaker so as to alter the characteristics of his or her voice. This problem - how to 
modify the speech of one speaker so that it sounds as if it was uttered by another 
speaker - is generally known as voice conversion (VC) [1]. 

In daily life, the individuality of voices allows us to recognize between different 
speakers. Also speaker identity makes it possible to differentiate between speakers in 
a conference call or on a radio program. Consequently there are a number of useful 
applications for controlling the speaker identity by means of a VC system, especially 
when integrated into other speech systems with either synthetic or natural speech 
output. 
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An example application is the integration of a VC system with a text-to-speech 
(TTS) synthesizer. Today’s state-of-the-art TTS systems are based on a concatenative 
synthesis method in which a system retrieves natural speech segments from a database 
and joins them together to generate a new utterance. The synthesis database contains 
an organized collection of carefully recorded speech, and the speaker identity of the 
synthesis output bears resemblance to the original speaker identity of the database 
speaker. The creation of a synthesis database for a new synthesis voice is a signifi- 
cant recording and labelling effort, and requires significant amount of computational 
resources. 

Using VC technology, new synthesis voices can be created by novice users quickly 
and inexpensively by creating a “speaker model” from a small number of speech ut- 
terances produced by the desired target speaker. The speaker model describes the 
characteristics of the target speaker’s voice. Using different speaker models, the syn- 
thesis system can generate speech signals with different speaker identities from a single 
speaker database, which plays the role of the source speaker [2, 3, 4]. 

Another application is in the area of very-low-bandwidth coding of speech. Speech 
coding systems that are designed to operate at 2400 bps or less do not preserve speaker 
identity during transmission [6]. For these systems, VC algorithms have the potential 
to render the decoded speech at the receiver so that it matches the speaker identity of 
the transmitting speaker. 

Voice Conversion systems can also be used in language interpreters and cross Ian- 
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guage voice conversion [7]. Researchers have also considered a VC system for rendering 
acoustically impaired speech more intelligible [10, 9]. 


1.1 General Voice Conversion System 


In voice conversion, we map the acoustic features of a source speaker to those of a 
target speaker. Figure 1.1 illustrates the typical model for a voice conversion system. 



Figure 1.1: General Voice Conversion System 


We collect speech in a parallel training corpus from both the source and target speaker 
for use in training the model. After training is complete, we predict what a target 
speaker sounds hke using the information from the new speech of the source speaker. 

A typical voice conversion system consists of seven components: the speech corpus, 
time alignment of the speech, analysis, training, voice mapping, synthesis, and post- 
processing. 
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Speech corpus - We must collect the same speech from a set of speakers. This speech 
should contain an even distribution of the phonemes of the speakers language so that we 
can model a variety of sounds; this variety improves the quality of the voice conversion 
system. In this work, we use the Arctic Corpus for training and testing [11]. 
Alignment of phonemes -We should have a robust method for time aligning the 
source and target speakers speech. Exact time alignment before training is necessary 
for optimal performance of this particular voice conversion system. 

Analysis -We must determine the relevant acoustic features to use in training the 
system. These features shbuld be able to represent a large portion of speech with only 

I 

a few parameters. We discuss the features we use for representing speech in §2.3 and 
§2.5. Examples of tj'pical featiures are the pitch or energy of a short segment of speech. 
Tr aining -We must figure out an appropriate model for training the voice conversion 
system. 

Voice Mapping -We perform the mapping of the source speakers features to the target 
speakers. Usually, this mapping is a statistical expectation. Training and mapping are 
the crucial items that we focus on in this thesis while also being the subject of most 
research. We mention a few of the previously employed techniques in §2.7. 

Synthesis -We must synthesize the transformed features into high quality speech. 
Post-processing -After synthesis, we might perform some final processing of the signal 


. to improve its quality. 
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1.2 Thesis Outline 

The following chapters are summarized below: 

Chapter 2 introduces the mathematical model for human speech production and 
presents the features with which we represent speech. 

Chapter 3 provides an introduction to a previous model for voice conversion. We 
then discuss the present research, a method of modelling the probabilistic acoustic 
space of both speakers with a Mixture of Factor Analyzers and performing conversion 
with this model. We also discuss a method to improve the performance of the voice 
conversion system by using unaligned data. 

Chapter 4 presents the objective and subjective results of our voice conversion 
system. It highlights the tradeoff of performance versus complexity that our system 
offers. 

Chapter 5 concludes and also discusses future opportunities for research in voice 


conversion. 



Chapter 2 


Theoretical Background 

In order to develop an effective voice conversion system it is important to understand 
the fundamental properties of speech. This chapter provides the background on how 
speech is produced and how their acoustics are modelled mathematically, the different 
speaker characteristics and a brief review on the previous methods for voice conversion. 

2.1 Physiology of Speech Production 

The main speech organs of the human speech production system is shown in Figure 
2.1. Speech is produced by a part of the human anatomy called the vocal tract, which 
begins at the vocal cords, or glottis, and ends at the lips. 

Air enters the lungs via the normal breathing mechanism. As air is expelled from 
the lungs through the trachea (or windpipe) , the tensed vocal cords within the larynx 
are caused to vibrate by the air flow. The air flow is chopped into quasi-periodic pulses 
which are then modulated in frequency in passing through the pharynx (the throat 
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Figure 2.1: Human Speech Production 


cavity), the mouth cavity, and possibly the nasal cavity. Depending on the positions 
of the various articulators (i.e., jaw, tongue, velum, lips, mouth), different sounds are 
produced. Consequently the air flow is the source for four types of sounds [13, 12]: 
Aspiration noise - The sound of air rushing through the entire vocal tract, similar 
to breathing through the mouth. 

Frication noise - The sound of turbulent flow at a point of narrow constriction, for 
example during the initial sound in “fair” . 

Plosion - The sound of an initial burst, for example during the initial consonant in 
“ton”. 

Voicing - A quasi-periodic vibration of the vocal cords or glottis, for example during 
the vowel in “key” . The frequency of vibration is called the fundamental frequency or 
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Fo and is perceived as pitch. 

The four types of sounds can occur in combination. For example, the initial sound 
in “vault” combines frication noise with voicing. Examples of voiced sounds in the 
English alphabet are the sounds produced when pronouncing the vowels a, e, i, o 
and u. Unvoiced sounds are produced by forcing air through the lungs and forming a 
constriction at some point in the vocal tract. Unvoiced utterances are noisy in nature; 
examples are the sounds produced when saying the English consonants s and /. 

2.2 Speaker characteristics 

The acoustic speech signal contains many types of information. Primarily, the signal 
carries information about the message (what was said), but also includes information 
about the speaker {who said it) and the environment {where it was said). Speaker char- 
acteristics describe the aspects of speech that are related to the person that produced 
it, independent of the message and the environment. The task of a voice conversion 
is thus to change the speaker characteristics of a speech signal, while preserving other 
types of information. The characteristics of a speech signal are commonly divided into 
the following types of cues: 

Segmental cues - These describe the “sound” or “timbre” of the speaker’s voice. 
Acoustic descriptors of segmental cues include formant locations and bandwidths, 
spectral tilt, Fq and energy. Segmental cues depend mainly on the physiological 
and physical properties of the speech organs and the speaker’s emotional state [15]. 
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Tiffid 


Figure 2.2; Speech Signals and Spectrograms for Two Speakers 


Suprasegmental cues - These describe the prosodic features related to the style of 
speaking, for example the duration of phonemes and the evolution of Fq (intonation) 
and energy (stress) over an utterance. The average behavior of phoneme duration, Fq, 
and energy are perceived as rate of speech, average pitch, and loudness. These cues are 
influenced by social and psychological conditions [14]. 

Linguistic cues - These include particular choices of words, dialects and accents. 
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Linguistic cues are beyond the scope of this thesis and will not be considered. 

We will illustrate some of the segmental and suprasegmental cues by considering the 
differences between two different speakers in an example. Figure 2.2 shows the wave- 
forms and spectrograms for two speakers uttering the sentence “For the twentieth time 
that evening, the two men shook hands” . We see how the magnitude of the formant 
changes over time with respect to the different sounds uttered by the two speakers. 
One of the differences in suprasegmental cues are manifested in the different duration 
lengths of the same speech spoken by the different speakers. The Suprasegmental cues 
can be changed easily by the speaker. However, segmental cues axe closely linked to the 
physiology of the speech production organs and can thus be considered as immutable. 

2.3 Acoustic Filter Model 

Speech is a highly non-stationary process; i.e., the statistics of the underlying signal 
vary with respect to time. But, for short segments of time, speech is either quasi- 
periodic or noisy. So we can assume that speech is wide-sense stationary - the first 
and second order statistics remain constant during these short segments. Thus, for 
these short segments of time, we can form a tractable model to represent the speech 
production process as described in [16]. 

A Powerful method for modelling a discrete-time system for a short segment of 
time is the Linear Predictive Coding (LPC) model. The basic idea behind the LPC 
model is that a given speech sample at time n, sfnj, can be approximated as a linear 
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combination of the past p speech samples, such that 

s(n) ^ ais(n - 1) + a 2 s(n - 2) + ... + aps{n - p) (2.1) 

where the coefficients Oi, 02 , ..., Cp are assumed constant over the speech analysis frame. 
Equation 2.1 can be converted into an equality by including an excitation term, Gu{n), 
giving: 

p 

^ — f) + Gu{n) (2.2) 

t=i 

where u(n) is a normalized excitation and G is the gain of the excitation. Equation 
2.2 can be expressed in the z-domain as 

S{z) = aiZ-^S{z) + GU{z) (2.3) 

1=1 

leading to the transfer function 

GU{z) 1-Y.UaiZ-^ Aiz) 

The normalized excitation source, u(n) is scaled by the gain, G, and acts as the input 
to the all-pole system, H{z) = to produce the speech signal, sfnj. Based on prior 
knowledge that the actual excitation function for speech is essentially a quasi-periodic 
pulse train (for voiced speech sounds) or a random noise source (for unvoiced sounds), 
the appropriate synthesis model for speech, corresponding to the LPC analysis, is as 
shown in Figure 2.3 [17]. 

This view of speech production is very powerful and it can explain the majority 
of speech phenomena. The normalized excitation source is chosen by a switch whose 
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Pitch 



Figure 2.3: Speech synthesis model based on LPC model 

position is controlled by the voiced/unvoiced character of the speech, which chooses 
either a quasi-periodic train of pulses as the excitation for voiced sounds, or a random 
noise sequence for unvoiced sounds. The appropriate gain G of the source is estimated 
from the speech signal, and the scaled source is used as input to the digital filter 
(Hfz)), which is controlled by the vocal tract parameters characteristic of the speech 
being produced. The parameters of the model are thus voiced/unvoiced classification, 
pitch period for voiced sounds, the gain parameter, and the coefficients of the digital 
filter, {ofc}. The model described above has the advantage that the computation of the 
Ofc coefficients is easily tractable with the Levinson’s algorithm. 

2.4 The Levinson-Durbin Algorithm 

In this section we provide a brief description of the Levinson algorithm which he formu- 
lated in 1947 [18] with performance improvements made by Durbin [19] for the specific 
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problem of a time series as we have with our speech model in Equation 2.2. Levinson’s 
algorithm runs in 0{p^) time compared with older methods which run in 0{p^) time. 

In order to present the Levinson algorithm we first consider the linear combination 
of past speech samples as the estimate s(n), defined as 

p 

s(n) = o-ksin - k) (2.5) 

*=1 

The prediction error e(n) is defined as, 

p 

e(n) = s(n) — s(n) = s(n) — ^ Ofcs(n — k) (2.6) 

fc=i 

To set up the equations that must be solved to determine the predictor coefficients, 
we define short term speech and error segments at time n as 

Sni'nT) = s{n + m) (2.7) 

en(m) = e{n + m) (2-8) 

and we seek to minimize the mean squared error signal at time n 


En = 

m 


(2.9) 


which can be written as, 

To solve Equation 2.10 for the predictor coefficients, En is differentiated with respect 
to each ojt and set to zero, 

1^ = 0, k=l,2,...p (2.11) 

oak 


En — 'y ^ Sfiipi) 


^OfcS„(m-A:) (2.10) 

k=l 
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yielding 


^ Sn(m - i)sn(m) = X! - k) 

Hn fc=l m 


( 2 . 12 ) 


Expression 2.12 can be written as 


4^n{j-i 0) 'y ( 


(2.13) 


fc=l 


which describes a set of p equations in p unknowns. It may be observed that the terms 
of the form X)m — i)sn{m — k) are terms of the short-term covariance Sn(n^)? i-®- 


<pn{i, k) = ^ri{m - i)sn{m - k) 


The minimum squared error, En can be expressed as 


(2.14) 


(2.15) 


= 53 ~ 53 53 Sn{m)Sn{m - k) 

m k=l rn 

It can be shown under the assumption that the speech segment s„(m), is identically 
zero outside the range 0 < m < iV — 1 that <pn{h k) defined before is identical to the 
short time autocorrelation function r„. 


<l)n{i,k) = r„(i - k) 


(2.16) 


where 


N-l-k 

fn{k)= 53 Sn{m)sn{m + k) 

m=0 


(2.17) 


Since the autocorrelation function is symmetric, i.e. r„(— A;) = rn{k), the LPC equa- 


tions can be expressed as 


j2rn{\i - k\)ak = rn{i), l<i<P 

k-l 


(2.18) 
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The optimal solution for {a*} is well known as the Wiener- Hopf solution, and we 
can use Durbin’s method to solve for the coefficients recursively. The algorithm finds 
the order solution for {at} by using the (p - 1)‘'‘ order solution. We give a brief 
description of the algorithm below. 


E^^^ = r(0) (2.19) 

Ki = {r(i) - - j\)}/E^^~^\ 1 < f < p (2.20) 

j=i 




(2.21) 


(i~l) (i~l) 

= a] ^ 

(2.22) 


= (1 - 

(2.23) 


The above equations are solved recursively for i = 1, 2, ..., p. The final solution is given 
as 

am — LPC coefficients = 1 <m < p (2.24) 

The algorithm exploits the toeplitz nature of the covariance matrix thus leading to 
significant gains in computational efficiency. 

2.5 Line Spectral Frequencies 

Line spectral frequency (LSF) is an alternative representation of LPC [20]. LSFs have 
several desirable properties which are discussed in this section. 

Two polynomials P(z) and Q(z) of order p+1 are formed from the order p prediction 
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error filter A(z) in the following manner. 


-(p+l)d(£ 


A(z) 


-) — A( 2:)(1 + G{z)) 


P{z)=A{z) (l + z 


Q(z) = A(z) (l - = /l(z)(l - G{z)) 


with 


G{z) = 

A{z) 


Therefore A(z) can be written as 




P(z) + Q{z) 


The LSFs are defined as those values of frequency u such that 


= 0 or = 0; 0 < a; < tt} 


( 2 . 25 ) 

( 2 . 26 ) 


( 2 . 27 ) 


( 2 . 28 ) 


( 2 . 29 ) 


Thus LSFs are frequency values associated to the unit magnitude zeros of PfzJ or QfzJ 
[21]. The important properties of LSFs are enumerated below. 

1. If AfzJ is minimum phase, then all zeros of PfzJ and Q(z) are on the unit circle. 

2. The LSFs uip of P(z) and the LSFs ujq of Q(z) are interleaved with one another, 
i.e., 

0 < a;pi < Uql < ... < UJpi < OJqi < ... < TT ( 2 . 30 ) 

This interleaving property provides a easy way to verify the stability of the un- 
derlying synthesis filter. 
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3. The LSFs have good interpolation properties and quantize well because they are 
more evenly distributed than LPCs [22], 

4. Two LSFs that are close in value correspond to a peak in the spectrum, and the 
peak can be interpreted as a formant [23]. LSFs can thus correlate well with the 
formant frequencies that identify a speaker. 

These properties of the LSF are desirable for a voice conversion system and thus 
LSF is the chosen representation for the speech features. 

2.6 Analysis 

This section describes how the speech features are calculated from the speech signal 
and the various choice of parameters involved in the analysis phase. 

We perform the analysis, processing and synthesis of speech by considering a small 
section of speech at a time. The original speech waveform is apportioned into small, 
overlapping frames s”^(n), thus the system is said to be frame based. 

Pitch and voiced-unvoiced decisions have to be estimated for each frame. Various 
methods have been reported in literature for estimating pitch and making voicing deci- 
sions in a frame. In this work we follow the method given by Ahmadi and Spanias [24]. 
The LPC filter coefficients are calculated using the Levinson’s algorithm described 
in §2.4. 

Figure 2.4 shows the effect of LPC prediction order, p, on the RMS prediction 
error en, for both sections of voiced speech and unvoiced speech [26]. It is seen that 
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Figure 2.4: Variation of RMS prediction error with the number of prediction coeffi- 
cients, p 


the prediction error for unvoiced speech is more than that for voiced speech. The 
result is intuitive as unvoiced speech is less linearly predictable than voiced speech. 
For prediction orders greater than 12 the curve is relatively flat and results in a less 
parsimonious representation of the sound. We have therefore considered an prediction 
order of 16 is this work. The obtained set of LPCs a^i = are converted into 

the alternative LSF representation with the aid of a root flnding procedure proposed 
in Soong and Juang [25]. 

2.7 Previous Methods for Voice Conversion 

Most existing voice conversion systems employ the methodology above and that de- 
scribed in §1.1. In this section we give a brief review on previous methods for voice 
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conversion. 

Codebook Mapping 

One of the earliest approaches to the voice conversion problem is the mapping codebook 
approach of Abe et at. [8], which was originally introduced by Shikano et al. for speaker 
adaptation [28] . In Abe’s approach, a clustering procedure - vector quantization (VQ) 
is applied to the spectral parameters of both the source and the target speakers. The 
two resulting VQ codebooks are used to obtain a mapping codebook whose entries 
represent the transformed spectral vectors corresponding to the centroids of the source 
speaker codebook. The main shortcoming of this method is the fact that the parameter 
space of the converted envelope is limited by a discrete set of envelopes, causing a drop 
in the quality of the converted speech. Arslan [30] extended this work by mapping 
not only the LSFs, but also the excitation; in addition, he improved the method by 
which the transformation weights were estimated. He named his method for updating 
the weights Codebook Weight Update by Gradient Descent; it is an optimization tech- 
nique that iterates until the energy of the weight errors falls below a threshold. He 
integrated this method into his larger framework for voice conversion called Speaker 
Transformation Algorithm using Segmental Codebooks (STASC). Turk extended the 
STASC framework by integrating subband voice conversion using wavelets and selective 
pre-emphasis [31, 32]. Orphanidou [33] utilized the Generative topographic Mapping 
[35] in reducing the dimensionality when mapping codebooks. 
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Mixture Modelling 

Stylianou modelled the acoustic probability space of the source speaker with a Gaussian 
Mixture Model (GMM) in [36, 37]. He then found the cross-covariance of the source 
and target vectors and the mean target vector using least scjuares optimization of an 
overdetermined set of linear equations. In his work, he demonstrated the theoretical 
superiority of the GMM by showing that vector-quantization methods for voice conver- 
sion are a special case of the GMM in which only the mean of a cluster is mapped. Toda 
[38] implemented the GMM algorithm for voice conversion within their STRAIGHT 
analysis-synthesis framework. Kain extended Stylianou’s work by modelling the joint 
probability density function of both the source and target speakers [5, 3]. This method 
obviates the need to perform the least squares optimization as with Stylianou’s method. 
Modelling the joint probability density allows the system to capture all possible corre- 
lations between the source and target speakers spectrum. Various enhancements have 
also been proposed to Kain’s method by Young [39]. Mark Wilde [34] utilized proba- 
bilistic principal component analysis to solve the problem. Recently, Mouchtaris has 
proposed a novel algorithm for non-parallel training algorithm for voice conversion by 
maximum likelihood constrained adaptation [40]. 

Other Methods for Spectral Conversion 

In [47], Lee used an orthogonal vector space transformation to convert voiced speech 
and a non-linear neural net predictor to model the excitation; he converted the exci- 
tation using mapping codebooks. Ning Bi used linear multivariate regression for map- 
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ping [9]. Watanabe used radial basis functions for performing the spectral mapping 
[44], and Narendranath used neural networks [45] with success. Salor [61] employed 
a least mean square adaptive filtering technique to filter the target speakers features 
from the source speakers. Although mapping with codebooks may be less expensive 
computationally, this method is less robust than mixture modelling, as the reduction 
of a continuous spectral space into a discrete codebook introduces quantization noise 
leading to a degradation in the quality of the converted speech. 

2.8 Thesis and Proposed Method 

We extend the spectral mapping aspect of voice conversion by modelling the joint prob- 
ability space of both speakers with a Mixture of Factor Analyzers (MFAs). Previous 
methods that used the GMM to model the space are constrained to only two possible 
selections for representing the covariance structure - diagonal and full covariance ma- 
trices. With diagonal structure, the training time is quick but conversion performance 
is sacrificed. With full covariances, we can model the underlying second order statistics 
with improved conversion performance but incur the penalty of longer training time. 

By modelling covariance structure with a Mixture of Factor Analyzers, we provide 
an entire range of covariance structure for the user to manipulate depending on the 
quality of synthesized speech. The existing systems for voice conversion require time 
aligned data of the source speaker and the target speaker. We also discuss a method 
to improve the performance of the voice conversion system using Mixtures of Factor 
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Analyzers by including extra unaligned data in the training phase. Objective and 
subjective tests are then presented by evaluating the proposed method against the 
system using GMM. 



Chapter 3 


Transforming the Spectral Envelope 

The objective of spectral transformation is to find a statistical mapping between those 
features of the source and target speakers which best represent their vocal tracts. After 
finding this mapping, speech is synthesized using the transformed features. 

In this chapter, we first discuss some pre-processing steps taken before training. 
We then highlight the baseline system which has been implemented for representing 
the probabilistic acoustic space of both speakers followed by the method for voice 
conversion. This system is designed to transform the spectral envelope of speech by 
changing parameters of an all-pole model, using a transformation function implemented 
by a Gaussian mixture regression model. In the later part of the chapter, we discuss 
our model which employs factor analysis that provides a more parsimonious model of 
the space with a number of advantages. The conversion function for the case when 
a mixture of factor analyzers is used, is derived. We also discuss the use of extra 
unaligned data in the algorithm to improve the performance. 


23 
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3.1 Time Alignment 

In our frame based system, the features of one frame describe only a small portion of 
speech and thus a sequence of features, or feature stream, represents an entire utterance. 
Because of natural variations in the durations of linguistic units between different 
speakers, the feature streams of the source and the target speaker must be time aligned 
before training the voice conversion model. We use the common method of dynamic 
time warping (DTW) to align the waveforms [17]. 

First, we trim both speakers waveforms with an algorithm that removes silence 
both at the beginning and the end so that the DTW algorithm has a better initial 
alignment [17], The goal of time-alignment is to modify the source and target speech 


Dynamic Time Warping 



Figure 3.1: Example of an alignment path 

LSF feature stream in such a way that the resulting feature streams can be thought 
of as describing the same phonetic content frame by frame. We achieve alignment 
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by selectively deleting or repeating frames from the target speaker feature stream to 
match the number of source frames. Alternatively, we can avoid deleting any frames 
altogether by stretching the shorter region of one speaker to the length of the longer one 
of the other speaker. DTW uses a dynamic programming strategy to find this optimal 
path. An example of the alignment path is shown in Figure 3.1. After alignment, we 
collect aligned LSF feature vectors into N frames of source data 

^pxN [^source ^source ^source ] (^■^) 

and respectively, target data 


^pxN {^target ^target ^target . 


(3.2) 


Beginning and ending silences are not included in the training data sets. An example 


LSF of R«f9r«nc« Spe«ch 



Figure 3.2: Two aligned LSF feature streams 


of a single sentence of two aligned LSF feature streams is shown in Figure 3.2. The 
value of N depends on the amount of training data. 
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3.2 Training 

After feature extraction and time alignment, we model the joint probability space of 
the source and target vectors x and y for all N frames of speech. The purpose of 
the training stage is to estimate parameters of a transformation function so that it 
can predict target speaker features y from the source speaker features x. We predict 
the best estimate of the target vector y given the source vector x with a conditional 
expectation £'[y|x]. 

3.2.1 The Gaussian Mixture Model 

For determining the joint probability density p(x,y), we consider the concatenation of 
the source and target feature vectors as the d-dimensional vector z for ease of notation. 



A mixture model allows the probability distribution of z to be modelled as a 
weighted sum or mixture of M component densities, also referred to as classes [41]. 
This is given by Equation 3.4. 

M 

P(z) = '^p(z/j)P(j) (3.4) 

3=1 

Each P{j) is a mixing weight or prior probability of component j occurring. The 
mixing weight satisfies the constraint that P{j) — 1- 1^^ 'the case of a Gaussian 
mixture model, each component density is a d-variate Gaussian function of the form 

(3.5) 


p(z/i) 
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where and Sj denote the mean and covariance of the component of the mixture 
model. 

The complete Gaussian mixture density is parameterized by the mean vectors, 
the covariance matrices and mixture weights from all component densities. These 
parameters are collectively represented by the notation 

A = j = (3.6) 

Rather than model the joint probability space of the speakers with a discrete set of 
vectors as do codebook techniques [8, 30], the GMM models this space as a continuous 
probability density. The benefit is that the GMM can model a complex, globally nonlin- 
ear manifold well with a collection of locally linear models that exploit the tractability 
of operations in the Gaussian domain. Previous researchers used the GMMs for the 
voice conversion problem because of the intuitive notion that the individual compo- 
nent densities models the underlying set of acoustic classes [41]. It is assumed that 
the acoustic space corresponding to a. speaker’s voice can be characterized by a set of 
acoustic classes representing some broad phonetic events, such as vowels, nasals, or 
fricatives. The GMM has been used successfully for both speaker identification [41] 
and voice conversion [37, 2]. 

Contrary to classification schemes with “hard” class boundaries, data points have 
varying degrees of “membership” to all local models; this is referred to as “soft” par- 
titioning. The conditional probability of a GMM class j given z is derived by the 
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application of Bayes’ rule 

p(j/z) = ( 37 ) 

p(z) 

= P(^/3)PU) 

E)li P(z/i)P(i) ^ ^ ^ 

The GMM parameters A == {{P(j), /jj, Ej}, j — 1, M} are estimated by applica- 
tion of an Expectation Maximization (EM) algorithm [42, 41], an iterative method for 
computing the maximum likelihood parameter estimates. For a sequence of N training 
vectors Z = {zi,...zn}, the GMM likelihood can be written as 

p(2/A) = np(2.A) (3.9) 

t=l 

Initially, the mixing weights P(j), (jl are initialized using the K-means clustering 
algorithm, and covariances E equal to the identity matrix. Based on this initial model 
the next iteration is carried out. 

On each EM iteration, the following reestimation formulas are used which guarantee 
a monotonic increase in the model’s likelihood value: 


Mixture weights: 

n=l 

Means: 

- _ F(i|Zn,A)Zn 

zLiPUK,x) 

Covariances: 

A _ En=l PjjK- A)|2n - Ajl|Zn - Aj 
^ EtiPO'N-A) 


(3.10) 


(3.11) 


(3.12) 
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The a posteriori probability for acoustic class i is given by Equation 3.8. The new 
model then becomes the initial model for the next iteration and the process is repeated 
until some convergence threshold is reached. It may be observed that Equation 3.12 is 
the limiting operation computationally because it involves the complete reestimation 
of the sample covariance matrix weighted by the posterior component probabil- 
ity. Thus, the EM algorithm’s complexity with fully populated covariance matrices is 
0{NM(P). 


3.2.2 Spectral Conversion with a GMM 


Having estimated the parameters of the GMM, we can now estimate the target speakers 
feature vector y from the source speakers feature vector x. The joint covariance matrix 
Ej for the Gaussian component is partitioned as follows. 




Sf 



where 

is the I X I auto-covariance of the source vector x. 


( 3 . 13 ) 




( 3 . 14 ) 


is the I X 4 cros&.covariance of the source vector x with the target vector y. 




( 3 . 15 ) 


is the 5 X 4 cross-covariance of the target vector y with the source vector x. It is 
the transpose of Equation 3.15. is the | x | auto-covariance of the target vector 
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y- 

E[[y - P^][y - (3.16) 

The expected value of a feature vector y given feature vector x for one component 
single gaussian is the regression 

■B[y|’'l = Jyp(yMiiv (3.i7) 

= M? + Ef (E“)-‘(x - M*) (3.18) 


Extending to the mixture case, the expectation is 

M 

F(x) = F[ylx] = ^ P(jlx)[/x| + Sf (Sf )-i(x - (3.19) 


with 


Pj = 


Ml 


(3.20) 


as shown in Kambhatla’s work on Gaussian mixture models for statistical data process- 
ing [43]. The Equation 3.19 is referred to as the conversion function [37]. 


3.3 Transformation 

In the transformation mode, the system analyzes a test speech file of the source speaker 
and transforms the extracted features x to y, an estimate of the target speaker’s LSF 
parameters. 

For each frame we calculate the transformed spectral envelope by converting the 
estimated LSF i)arameters back to LPC filter coefficients which in turn are used for 
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synthesis. In addition, pitch scaling is employed to match the pitch of the source 
speaker to that of the target speaker. 

3.4 Voice Conversion using Factor Analysis 

Although the GMM has become quite popular in recent times for modelling complex 
probability densities, it has a few shortcomings. The use of a GMM with full co- 
variance matrices leads to a huge number of parameters for a high-dimensional input 
space and presents the risk of over-fitting. Govariance matrices can at the most be 
constrained to be diagonal. The latter constraint leads to a model in which the axes of 
the Gaussians are aligned with the data axes and which does not capture correlation 
amongst the variables. Thus, each of these parameterizations has its disadvantages. 
With full covariance. matrices, each EM step requires 0{NM(f) operations, where N 
is the number of vectors in data space, M is the number of components in the mixture 
model and d is the dimension of the data space. Diagonal covariance matrices limit 
the computational complexity to O(NMd) and restrict the amount of data needed for 
reliable estimation. 

A compromise between these extremes can be found in the recently introduced mix- 
ture of latent variable models [48, 49] which form a mixture of constrained Gaussians. 
The advantage of using mixtures of latent variable models is that one can avoid the 
constraint of aligned axes (thus capturing correlations) without needing a full covari- 
ance matrix. This can be done by using the freedom we have in choosing the dimension 
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ot the so-called latent space: the covariance matrices of the Gaussians are specified and 
controlled through a mapping from this latent space to the data space. This idea is 
illustrated in Figure 3.3. 



Figure 3.3: A generative model from a latent space of dimension 2 to a data space of 
dimension 3 

A latent variable model relates a d-dimensional observed data vector z to a g- 
diincnsional {(] < d) latent vector f by defining a noise model and a prior on the 
distribution of the latent variables. Recently, there has been a great deal of research 
on the topic of local dimensionality reduction, resulting in several variants of the basic 
concept with successful applications to character recognition [50]. The algorithm used 
by these authors for dimensionality reduction is Principal Component Analysis (PCA). 
PC A, unlike maximum likelihood factor analysis (FA), does not define a proper density 
model for the data [52]. Furthermore, PCA is not robust to independent noise in the 
features of the data. The Mixture of Factor Analyzers (MFA) first proposed by Zoubin 
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Ghahramani [49] and then by McLachlan [53] can be used to solve the inflexibility of 
GMMs and achieve local dimensionality reduction in each cluster. Comparison studies 
by Moerland [51], proved that Mixture of Factor Analyzers outperform Mixtures of 
Principal Component Analyzers [48]. 

3.4.1 Factor Analysis 

In maximum likelihood factor analysis, a d-dimensional real-valued data vector z is 
modelled using a q-dimensional vector of real-valued factors, f, where q is generally 
much smaller that d [54]. The generative model is given by: 

z = Af -{• P € (3.21) 

where A is known as the factor loading matrix. The factors f are assumed to be ^"(0, 1) 
distributed (zero-mean independent normals, with unit variance). The d-dimensional 
random variable e is distributed N(0, ip), where ^ is a diagonal matrix. The diago- 
nality of ip is one of the key assumptions of factor analysis: the observed variables 
are independent given the factors. The term /a represents the non zero mean which 
the data can assume. Under these assumptions, the observations z are Gaussian with 
mean 


F/[z] = F[Af P fj, P c] 

= AE[{]pE[/x]pE[e] 

= n 


(3.22) 

(3.23) 


(3.24) 
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and model covariance C. 


E[[z -fj][z- ixf] 

(3.25) 

£;[[Af-he][Af + ef] 

(3.26) 

AEifF^jA^ + £[€f^]A^ -P AE[fe'^] + E[ee^] 

(3.27) 

AA^ + Ip 

(3.28) 


Given A and ip, the expected value of the factors can be computed through the linear 
projection 

E[f\z] = /3(z - /x) (3.29) 


where /? = A'{ip + AA') \ a fact that results from the joint normality of data and 


factors. 


P 


VL 


z 

f 


= N 


Pz 

0 


AA' + ip A 
A' I 


(3.30) 


where I is an identity matrix. Furthermore, it is possible to compute the second moment 
of the factors [49], 


E[tT\z] = Kar(f Iz) + |z]£;[f |z]' 


(3.31) 


= I-pA-[-P{z-ti){z- fiyp' (3.32) 


With this model, the common factors f account for the statistical dependencies 
between the individual variables of z, and the specific factors e explain small distur- 
bances in each individual random variable of z. In our model for voice conversion, the 
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common factors f capture the correlations between LSFs, and the specific factors cap- 
ture the sensor noise about each individual LSF. Capturing these correlations allows 
us to eliminate the redundancies in the LSFs. 


3.4.2 Mixture of Factor Analyzers (MFA) 

In this work, we follow the EM algorithm proposed by Ghahramani [49]. We give a 
brief overview of the mixture of factor analyzers in this section. 

We assume we have a mixture of M factor analyzers indexed by Uj,j = 1, ...M. 
The generative model now obeys the following mixture distribution: 




(3.33) 


As in regular factor analysis, the factors are assumed to be N{0, 1) distributed, 
therefore 


P{i\uj) = P(f) = N{0, 1) 


(3.34) 


The idea behind the mixture of factor analyzers is illustrated in Figure 3.4. Each 
factor analyzer in the mixture has a different mean and covariance Ej. Therefore, 


P{z\f,u^) = N{^lJ + ^^^,7P) (3.35) 

As discussed in §3.2.1, data z is given by the Expression 3.3. 

The parameters of this modeTare {(/i^-, Aj)jli,7r,V^}; the vector tt parameterizes 
the adaptable mixing proportions, ttj = P{oJj)- The latent variables in this model 
are the factors f and the mixture indicator variable a;, where Wj = 1 when the data 
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7T 



Figure 3.4: The mixture of factor analysis generative model 

point was generated by wj. For the E-step of the EM algorithm, one needs to compute 
expectations of all the interactions of the hidden variables that appear in the log 
likelihood. The following statements hold true and can be verified, 

E[wjf\zi] — E[wj\zi] E[f\wj,Zi] (3.36) 

E[wjf\z^ = E[wj\zi] E[i\wj,Zi] (3.37) 

Defining 

hij = E[wj\zi] = T(jN{zi - Up AA' + ii) (3.38) 

and using Equations 3.29 and 3.36 we obtain 

E[vJj^\z^] = hijpj{zi - Hj) (3.39) 

where pj = A'j{p -I- Similarly using Equations 3.32 and 3.37, the following 

expression is obtained 


BlwjS'\z,] = hy(I - 0jAj ■<- I3j{z, - - lijYP'j) 


(3.40) 
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'Flu' ex])ectcil log likelihood for mixture of factor aaalyzers is 
g - - fxj - A,f]V'-i[z.i - pxj - Ajfj}}"'^] (3.41) 

I h(! LM algorithm for mixtures of factor analyzers is briefly discussed: 

E-Step: Compute E[f |z,,u,'j] and E[fF'|z,:, Xj] for all data points i and mixture 
compoiKuits /. 

M-Step: The I'cestimation formulae for Hj and tp are obtained by maximizing 

Equation 3.41. These are listed below: 


wliere 


and 


1 


N 


i=i 


L 




E\f\Zi,ujj] 

1 


E[ff'|z/,u;,] 


E[ff'\zi,u;j] E[f\zi.ijj\ 

E[f|zi,u;j]' 1 

Tire reestimation formula for ’0 is given by the expression 


(3.42) 

(3.43) 


(3.44) 


(3.45) 


i> ^ '^diag |e hj (zi - AjE[flzi,u.;j]) z'ij (3.46) 

Idle mixture of factor analyzers is, in essence, a reduced dimensionality mixture 
of Gaussiatis. Eacli factor analyzer fits a Gaussian to a portion of the data, weighted 
by the posterior probabilities, hij. Since the covariance matrix for each Gaussian is 
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specified through the lower dimensional factor loading matrices, the model has Mqd+d, 
rather than Md{d + l)/2, parameters dedicated to modelling covariance structure. 

3.4.3 Use of UncJigned Data to Improve Performance 

The method discussed is the previous sections, require the source data in addition to 
its corresponding aligned target data. There are many applications of voice conversion 
that, in the training step, more data from the source speaker is available than of the 
target. For example, if we are going to personalize the output of a Text to Speech 
Synthesizer system sentences can be generated of the source speaker. However the 
target data is fixed. This section discusses a method to incorporate this extra unaligned 
data in the training phase. 

Previous studies [56] have shown that, including unlabelled data in classification 
problems leads to an increase in the performance of the classification. We apply this 
idea to the voice conversion problem. 

The expected likelihood function for the mixture of factor analyzers in Equation 
3.41 is modified to include unaligned data x^,. 


i,j fcj 


Substituting for P(xi,yj|f,u;j) and P(xfc|f,u;j) from Equation 3.35 we get, 


Q =P 


log]\U2'KY^^\i}\-^l^exv{-^[2‘i - - Ajf]> ^[zi - 

id ^ 


n - Mix " 


(3.48) 
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To initialize the model, first the parameters of the MFA model are estimated from 


aligned data as in §3.4.2. The following assumption is made for simplicity. The matrix 
■0, i.e. the covariance structure of e is not affected by addition of extra unaligned 
data. An EM algorithm can now be derived from Expression 3.48 to recalculate the 
means, factor loading matrices and the weights without modifying 0. The problem 
now comprises of obtaining a better estimate of the mean vector fij, the factor loading 
matrices Aj and the mixing weights P{ujj). 

To jointly estimate and the factor loading matrices Aj let 


f = 




Aj 



" 




Ajx 

= 

Ajx 

Pjx 

< 
1 


Ejy 

1 


(3.49) 


(3.50) 


Rewriting, Equation 3.48 with the above substitutions and simplifying, we get 


Q = E 


n 

kj 








^ ^ 

'^^^exp{—[zi - Ajf]'0"Mzi - Ajf]}| 
-^/hxp{^[xk - Ajf]'0;Mxfc - Ajf]}} 


(3.51) 


c - 


(3.52) 


^ ^hijz'^-^Zi - hij2f^-%E[fKu;j] + -hijtr\A'jr^AjE[irKujj] 


tr2 
1 

2^ 


^ - hkjx[^p |x^, ujj] + -hkjtr[Aj'il} ^AjE[ff'\xk,ujj] 




where c is a constant, hij = P{ujj\zi) (from Equation 3.39) and Ni and N 2 are the 
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number oi vt'ctors in the aligned data Zj and the unaligned source data x*. respectively. 


To estimate and Ajy, we set ^ = 0 and ^ = 0 

C'AjX dAjy 


dQ 

d\j. 


+ hij^-^Aj:,E[rr\zi,u;j] 

1 

X; -hj^-;^XkE[I\xk,uj,y T hkfiP;^A^,E[rr\xk,u;j] 


Soh’inii, for we obtain 


A 


JT 


J2f^jXiE[i\zi,ujj\' + J2hi,jXkE[f\xk,ujj]' 

k 

^ h„E[rf'\z,,ujj] + M,E[fT\xk,u;j 


1 -i 


Siniilarly, 


i)Q 

OA 


jy 


1-1 


L I 


(3.53) 


(3.54) 


= - H -hiji’y ViE(f|zi,u;j]' T hij'i/jy %yE[ii'\zuujj\ = 0 (3.55) 


(3.56) 


llierefoi'cn 

^hijyiE[i\zi,ujj]' 

- i 

To reest imato tlie mixing weights, using the definition for P{oJj) and the empirical 
distribution of data as an estimate of P(x) 


P{^j) = j P{oJj\x)P{x)dx 


1 / w, yvj ^ 

X) hij + X 




(3.57) 

(3.58) 


3.4.4 Spectral Conversion with MFAs 


In order to convert the spectrum in the Factor Analysis case, we calculate the expecta- 
tion of the target vector y given the source vector x for the single Factor Analysis model 
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and t hen ext.i'iid the lesult to a Mixture of Factor Analyzers. To find this expectation, 
we [lartition z so that its joint multivariate density is 



X 

I 

= N 

/ 


\ 

V 

y 

J 


V 

. . 

/ 


(3.59) 


when' T is the covariance matrix given by 


E = 




■‘yx 


s, 


yy 


From Kquation 3.28 the terms Exx and Eyy are given by 


(3.60) 


— A^pA^, + '0x 

(3.61) 

= ^yAy + 'ijjy 

(3.62) 


Bc}uation 3, 15 leads to the following expressions for and Eyx 


Exy = AxA^ (3.63) 

Eyx = AyA^ (3.64) 

Using Eciuation 3.18, the conditional expectation of a joint Gaussian, we find that 

E[y\x] = AyAl{AxAlF^.)-\^-Hx) + Ry (3-65) 

Elxteuding the result to the mixture case, the statistical mapping becomes 

M 

Aly|x] = £ P(wj|x)[A.yA^(AxAj + 0x)“^(x - + fXjy] (3.66) 

j=i 


In fitting a mixture of factor analyzers the modeler thus has two free parameters to 
decide; the number of factor analyzers to use (M), and the number of factors in each 
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analyzer {q). These parameters can be chosen empirically based on the quality of the 
synthesized speech. 



Chapter 4 


Evaluation 

4.1 Speech Corpus 

For our research, we specifically chose the Arctic Speech Corpus [11] developed recently 
by Carnegie Mellon University(CMU) which is a high quality speech corpus with a free 
softwiire license. It was specifically created for the purpose of speech synthesis. 

Since this thesis only deals with spectral conversion, the Arctic corpus suits our 
purpose. The CMU Arctic Speech Corpus is a set of single speaker databases that 
have been carefully recorded under studio conditions. The databases consist of around 
1150 phonetically rich sentences carefully selected from out-of-copyright texts from 
Project Gutenberg [27]. The corpus consists of four primary sets of recordings (3 male, 
1 female). The corpus consists of bit speech ‘wav’ files with a sampling rate of 16 
khz. 


43 
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4.2 Objective Evaluation 


In evaluating our system, we use the objective measure given by Kain [5]. This perfor- 
mance index is a ratio of two measures. The first measure, the transspeaker distance, is 
the spectral distance between the converted speech and the target speech determining 
how close the converted speech is to the target speakers. The second, the interspeaker 
distance, measures the spectral distance between the source and target speaker. To 
present the performance index, let us again consider the vector of source speech for the 

I 

rC'- frame as x„ and the target speakers vector as y„. We compute the performance 
index P with the following equation 


i:LiD{yn,yn) 

T,n=l D{yn, X„) 


(4.1) 


whert^ y„ is the converted target vector. Since our feature for representing speech is the 
LSF, the distance measure D between any two LSF vectors a and b with dimension p 
is Euclidean. 

(^-2) 

We interpret a performance index close to 1 as a good conversion while one close to 0 
is not performing well. Equation 4.1 is close to 1 when the source and target speaker 
have different spectral properties and when the converted speech is close to the target 
speakers speech. Thus, according to the objective measure, it is difficult for the system 
to perform well if the source and target speakers spectral properties are already similar. 

Figure 4.1 shows the graph of performance index versus the training time for a 
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Figure 4.1: Performance Index versus Training Time for conversion using mixture of 
factor analyzers 

voice conversion system using mixture of factor analyzers with 8 components. As the 
training time increases, the performance of each system exponentially increases until it 
plateaus around 15 seconds of training data. We conclude that 15 seconds of training 
data (about 5 sentences of 3s each) is enough for the system to give a reasonable 
average performance between 0.25 and 0.30. 

Figure 4.2 shows the variation of performance index for different number of com- 
ponents chosen in the mixture model for three cases viz. mixture of factor analyzers, 
mixture of factor analyzers with unaligned data and for full covariance GMM. The 
dimension of the latent variable f was chosen as 8 in this experiment. It can be seen 
that GMM performs much better than MFA for components greater than 12. When 
unaligned data is used in the EM training, this system outperforms the previous two 
cases. The performance drops when the number of components is increased because of 


overfitting. 
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Figurt' 4.2; Performance Index versus number of components in the mixture model 

Figure 4.4 displays an exanaple of envelope conversion for the case using a mixture of 
factor analyzers, the case when unaligned data is included and is bench marked against 
the full covariance GMM method. It can be seen that there is a shift in the formants 
C)f tlu' c(>uvcrted envelopes, and the closest fit to the target envelope is achieved when 
unaligned <lata is used in the training phase. 

4.3 Subjective Tests 

Subjective listening tests were conducted with seven listeners to assess the recogniz- 
al)ilit>- and (luality of the converted speech for the system using a mixture of factor 
analyzers with and without unaligned data along with the system using GMM. Be- 
fore st.tut.ing the test, we trained each listener on each of the four speakers distiiact 
v()ic;(' (nullities by irlaying several of their speech files. Alter this mitral training, we 
(luizzed each listf'iier until they identified each speaker correctly for ten consecutive 
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t rials. I liree kinds of listening tests have been conducted; ABX test, speaker discrim- 
inaticn test and system quality comparison test. After quizzing, we began the test, 
jrlaying the same sentence “God your letter came just in time” for different cases. Be- 
tween each utterance, we gave the listener 10 seconds to write down his answer; these 
time inter\'als made tlie entire test last about 50 minutes. We played the same sentence 
so that listeners could focus intently on the quality of each recording. 

ABX Test: 

To evaluat.e the accuracy of the conversion, a set of trials were presented to the listeners 
using the AFiX metliod. X was either the converted speech by using GMM with 12 
components or the converted speech using MFA with or without unaligned data using 
8 components. A and B were either the target or the source speaker. Speakers A and 
B uttered the same sentence which in general was different from the sentence uttered 
l)y X. Subjects wt're asked to select either A or B as being most similar to X. Figure 4.5 
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Figure 4.5: Results from the ABX Test 
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suininaiizes the lesiilts ^loni this test giving the percentage of correct answers. A cor- 
lect aiiswci nu'ans that the converted/ modified speaker was recognized as the target 
speaker. Conversion obtained using only MFA scored 63% and was marginally below 
that obtained by GMM. However, when unaligned data was used in training 76% of 
the time, the correct answers were received. 

System Quality Comparison Test: 

To assess tlie sjjeech quality of the various voice conversion systems in terms of intelligi- 
bility and naturalness, we compared them against each other. The listeners were asked 
to give their preference scores on a scale of 1-5, 1 being the lowest score and 5 being 
the highest, for various samples of the converted speech. The listeners preferences are 



1 2 3 4 5 


Score 

Figure 4.6: System Quality Comparison Test 

shown in Figure 4.6. The overall quality of the converted signals was considered as 
(luite nat ural, although some of the listeners reported a muffling effect in some cases. 
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listeners scorocl a liigher preference to the converted speech using MFA with unaligned 
data. 


Speaker Discrimination Test: 

In test, tlie listeners were asked to recognize the speakers. The converted speech 
samples obtained Iroiu the three methods were played randomly. The recognition accu- 
raev was rdxint 72 /{ foi' (IMM and the MFA using unaligned data and was marginally 
h'sser (tnS.fj'/ct) for the c:onversion obtained using only MFA. 
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Figure 4.7: System Discrimination Test 


List ('U('rs liad dilficuity in distinguishing between two male speakers in the data- 
l )ase. List ('uers were also presented samples of converted speech obtained using different 
munl>er of dimensions in the factor analyzers. None of the subjects could differentiate 
Ix'twcH'u tlu' samples when the differences in the dimensions were small. When mod- 
• 'lling th(> prol>al)ilitv density with a mixture of fa.ctor analyzers, we are thus able to 
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jjii'scnt ;i fiiH'r resolution of options to the end user of the voice conversion system. 
I he enii user can seh'ct with a greater degree of freedom how well the system should 
perform de])cnding on t he ajrplication. 





Chapter 5 


Conclusion 

5.1 Summary 

In summary, we have applied the mixture of factor analyzers model to voice conversion. 
The MFA model has Mqd + q, rather than Md{d+ l)/2 parameters dedicated to 
modelling th(! covariance structure. The time complexity for training is reduced from 
0{MNd’^) to O(MNdq). This reduction in complexity comes with the penalty of a 
slight decrease in the objective quality of conversion. It has also been shown that a 
combined learning with aligned source - target speaker data and unaligned source data 
increases the conversion performance. Subjective tests showed that, small changes 
in the dimension of the factor analyzers did not affect the perception of the speech. 
Therefore user of the system can select an appropriate value of the dimension of the 
factor analyzer to suit his performance needs of the application. For voice conversion 
training and execution, the mixture of factor analyzers model provides a flexible range 


of t.radeoffs to sek'ct from. 
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5.2 Future Work 

Tlie system can be improved upon in several ways by incorporating the recent advances 
of variational Bayesian modelling, independent component analysis and on line EM 
algorithms. 

By incorporating variational Bayesian techniques with MFA as described in [57], 
the system can can automatically determine the optimal number of components and 
the local dimensionality of each component (i.e. the number of factors in each factor 
analyzer). This method falls in the automatic relevance determination framework 
of variational Bayesian techniques. Having this ability implies that we can describe 
the nonlinear probability density with an appropriate number of components for the 
mixture. Incorporating this technique increases the complexity slightly but solves the 
difficult problem of estimating the correct model order and dimensionality of each 
c;omponent and alleviates the end user of these responsibilities. 

The EM algorithm used here is a batch algorithm in which the whole training data 
is scanned at every iteration to improve its estimation. Including new test data in 
the model requires a complete reestimation of the model parameters. In addition, 
memory space required by the algorithm should be constant with respect to number of 
data i)oints processed so far. Online EM algorithms can be thus used overcome these 
problems and the model can be updated online based on new training data. 

Using a Gaussian for each component may not always hold true in natural clustering 
prol)lem,s. One way to solve this problem is to use a mixture of independent com])on('nt 
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analyzers as described in [55] where each component’s distribution is non-Gaussian. 
With this approach, it may be possible to describe the density of each component in 


tlie mixture more accurately. 
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